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ABSTRACT 

We focus on two research issues in entity search: how to 
score a document or snippet that potentially supports a can- 
didate entity, and how to aggregate or combine scores from 
different snippets into an entity score. Proximity scoring 
has been studied in IR outside the scope of entity search. 
However, aggregation has been hardwired except in a few 
cases where probabilistic language models are used. We 
instead explore simple, robust, discriminative ranking algo- 
rithms, with informative snippet features and broad families 
of aggregation functions. Our first contribution is a study of 
proximity-cognizant snippet features. In contrast with prior 
work which uses hardwired "proximity kernels" that imple- 
ment a fixed decay with distance, we present a "universal" 
feature encoding which jointly expresses the perplexity (in- 
formativeness) of a query term match and the proximity of 
the match to the entity mention. Our second contribution 
is a study of aggregation functions. Rather than train the 
ranking algorithm on snippets and then aggregate scores, 
we directly train on entities such that the ranking algorithm 
takes into account the aggregation function being used. Our 
third contribution is an extensive Web-scale evaluation of 
the above algorithms on two data sets having quite different 
properties and behavior. The first one is the W3C dataset 
used in TREC-scale enterprise search, with pre-annotated 
entity mentions. The second is a Web-scale open-domain 
entity search dataset consisting of 500 million Web pages, 
which contain about 8 billion token spans annotated auto- 
matically with two million entities from 200,000 entity types 
in Wikipedia. On the TREC dataset, the performance of our 
system is comparable to the currently prevalent systems by 
Balog et al. (using Boolean associations) and MacDonald 
et al. On the much larger and noisier Web dataset, our sys- 
tem delivers significantly better performance than all other 
systems, with 8% MAP improvement over the closest com- 
petitor. 

1. INTRODUCTION 

In its simplest form, entity search queries provide a type 
(e.g., scientist) and ask for entities that belong to that type 
and satisfy other properties, expressed through keywords 
{played violin). Entity search is a prime example of search- 
ing the "Web of Objects'[^or going from "strings to things']^ 
pursued currently by all major search engines. 

Machine learning in general, and learning to rank (L2R) 
[24| in particular, can be brought to bear on entity search 
in two key interrelated issues: 
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• How should a context be scored wrt the query? Specif- 
ically, what form of scoring function will take into ac- 
count the perplexity (rarity) of query words and their 
proximity to (mentions of a) candidate entity in a gen- 
eral, trainable fashion? 

• How should evidence from many contexts be aggre- 
gated into a score or rank for an entity they support? 
Can the context scoring model be learnt without con- 
text labels by directly optimizing for entity scores or 
ranks? 

Several existing formulations tackle the two issues separately. 

1.1 Proximity scoring 

Scoring documents and passages taking query word match 
proximity into account is well established [6j |30[ |25| in IR, 
but largely outside the domain of entity search. Some sys- 
tems [11| |22| use hardwired proximity scoring for entity 
search, without using L2R. Recently, "proximity kernels" 
have been used [28| in entity search based on generative 
language models, with tunable width parameters. However, 
we know of no end-to-end L2R system where the proximity 
scoring function is itself learnt from entity relevance judg- 
ments. As we shall see here, the issue of robust, trainable 
proximity scoring is far from closed. 

1.2 Evidence aggregation 

With very few exceptions [15[ |27| 



entity and expert search 
algorithms in the IR community are heavily biased toward 
generative language models [12[ [T| [T4) |2]. In contrast, some 
of the best-known L2R algorithms use discriminative max- 
margin techniques [T7| [l8 [TgI [il [5| [9| [34| or conditional 
probability formulations [7 32 ^ . In Web search, the best 
L2R algorithms are believed to perform considerably bet- 
ter than hardwired scoring functions from early IR systems. 
And yet, entity and expert search have benefited little from 
L2R techniques. 

A likely reason is the following gap in the respective mod- 
els. In learning to rank (L2R), each item to be ranked is 
represented by one feature vector. In entity search, each 
item is an entity, potentially supported by many contexts, 
which may be short token sequences or entire documents. 
Each context, not entity, is associated with a feature vector. 
On the other hand, it is far easier to get entity relevance 
judgments than context relevance judgments. 

Owing to distributional assumptions, probabilistic retrieval 
models [12| [T|[T4l 28, 15 hardwire the manner in which indi- 
vidual context scores contribute to the score (thereby rank) 
of an entity. As we shall see in Section [2[ these forms of 
aggregations have certain limitation. In later work, Balog 
et al. [2] allowed non-probabilistic aggregations. Macdonald 
et al. [26[|27| were the first to systematically explore a family 
of aggregation functions and use them as features in a L2R 
setting. They also used hand-crafted rank cutoffs to elimi- 
nate noisy or unreliable support contexts. Cummins et al. 
[13| used a genetic algorithm to find a soft rank cutoff. 

1.3 Our contributions 

We started with the goal of unifying hitherto unconnected 



work on L2R, proximity scoring and evidence aggregation 
into a simple and uniform learning framework. It turned out 
that the new framework is also more robust across diverse 
data sets, matching or beating all known systems. 

In Section |3] we explore feature design. In contrast with 
earlier proximity kernel [28[ |25| approaches that combine 
a generative language model with a decay function having 
tuned width parameters, we propose a very general frame- 
work for feature design that encodes information about the 
rarity (also called "perplexity", often measured via inverse 
document frequency) of query words matched in a context, 
as well as their distance from the candidate entity mention. 
In particular, we do not combine these two signals in a hard- 
wired manner. 

In Section|4]we explore trainable evidence aggregation. In 
past work, only Fang et al. [15] proposed a document scoring 
model that was trained using end-to-end entity relevance 
judgment. We propose a family of pairwise ranking loss [24] 
optimization problems to deal uniformly with a variety of 
context score aggregation functions. 

In Section [5] we present a detailed experimental study of 
the above approaches using two data sets. The first one 
is W3C dataset, from TREC expert search task used in 
many earlier papers. This corpus has under 350,000 doc- 
uments from the W3C Web site with six different types of 
web pages (emails, code, wiki, personal homepages, web and 
misc). Since the dataset was used for enterprise search track, 
there is only one entity type: person. We performed no spe- 
cial processing for specific types of pages. The query set 
for this dataset contains 50 and 49 "topics" from the TREC 
2005 and 2006 enterprise tracks. Relevance judgements were 
also provided, with about 4400 relevant candidates for the 99 
queries. To facilitate standardization, we used the annotated 
version of W3C dataset prepared by Jianhan Zhu, available 
from https : //ir .nist . gov/w3c/contrib/W3Ct agged.html, 
containing about 1.6 million annotations. 

The second corpus is a representative Web crawl from a 
commercial search engine, with 500 million spam-free En- 
glish documents. Token spans that are likely entity mentions 
are annotated in advance with IDs from among two million 
entities belonging to over 200,000 types from YAGO [29]. 
These annotations (about 8 billion) are then indexed along 
with text. We use 845 entity search queries collected from 
many years of TREC and INEX competitions, leading to 
93 million contexts supporting candidate entities. This is 
perhaps among the first Web scale entity ranking testbeds 
where all candidate contexts can be analyzed without de- 
pending on a black-box document-level ranking function with 
possibly extraneous scoring considerations like PageRank or 
click statistics. We will place our code and data in the public 
domain to promote Web-scale entity ranking research. 

1.4 Results 

• Purely probabilistic language models that use an ex- 
pectation over contexts lose vital signal in 15*^1, the 
number of contexts supporting candidate e. 

• However, perplexity -f proximity features add further 
statistically significant accuracy to just context count. 
Very simple features that encode perplexity (rarity) of 
query term matches and their proximity from the en- 
tity mention are better than fitting proximity kernels. 

• On TREC, a simple non-probabilistic sum-of-context- 
score scheme [5] model 2, Boolean association] and a 



voting scheme [27[ |26] are competitive. However, our 
system gives comparable performance. 
• On the Web testbed, our system is statistically signifi- 
cantly superior to all prior systems. Thus, the two data 
sets behave differently. Our system is more robust to 
the larger corpus with noisy entity recognition. 

2. RELATED WORK 

We set up some uniform notation. A query is denoted q. 
Here we will model g as a set of words and possibly phrases. 
Some of these may be compulsory for a match, others are 
optional. The set of candidate entities for q is denoted Eq, 
dropping the subscript if unnecessary, e £ Eq is a. candidate 
entity (in earlier work sometimes named c or cd). 

A context supporting a candidate entity may be a whole 
document or a short span of tokens (which we call a snip- 
pet) approximately centered on a mention of the entity. An 
entity may be mentioned in multiple places in a document. 
Likewise, a query term may appear several times in a doc- 
ument, or even in a snippet. In this section, we will use 
Sc. to denote the set of contexts that potentially support e, 
without committing on whether x is a document or snippet, 
a; € Se is one context. 

The dominant language modeling approaches find the score 
of context x as Iltsq P^^I^I^j '^^'^ then aggregate these 
somehow over x to find a score for e. 

2.1 Scoring one supporting context 

Early expert search systems [l] |26] did not use proxim- 
ity signals, and instead scored the whole supporting docu- 
ment. Proximity scoring outside expert search began around 
the same time [6] [30] or later [25] . Petkova et al. [28] first 
used proximity scoring in expert search using "kernels". A 
proximity kernel k{i, o) is a non-negative function of a term 
offset i and an entity mention offset o, that decreases with 
|i — o|. Instead of using terms from document x uniformly 
to construct a language model 'Pv{t\6x), they use k to con- 
struct a position-sensitive language model 'Pr{t\9x,o) where 
the contribution of the term ti at offset i is scaled by k(i, o). 
Ranking accuracy is not very sensitive to the form of k\ a 
Gaussian centered at o works well. 

Proximity kernels in generative language models. Note 
that, by definition, X^t Pi'(*l^2^,o) = 1- Consider entities 
61,62 supported by documents X\,X2. In xi, e\ is men- 
tioned at oi = 10, 62 in X2 at 02 = 100. Say there are 
two query terms, and they occur at positions 8, 13 in xi 
and 80, 130 in X2. Then the models 'Pv{t\9x,o) will be iden- 
tical for X = xi,X2, and the absolute proximity information 
will be lost. Based on only X\,X2, there would be no rea- 
son to prefer 61 over 62. This is an important limitation of 
proximity-based language models that has not been high- 
lighted before, and that warrants an investigation of purely 
feature-based approaches, that we present in Section [3] 

2.2 Aggregating noisy evidence 

Balog et al. ilj[2j were among the first to popularize gener- 



ative language models, originally used in traditional IR 20 
to expert search. Their best model (which we call Balog2) 
proceeds as Pr(g|6) = X^ojgs Pr(g|x, 6) Pr(2;|6). This leads 
to a sum-product form: 

Pr(aie) = E.ec illt^q Pr(t!^, e)) Vr{x\e). (SumProd) 
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The event space associated with Pr(a;|e) has been some- 
what murky; in particular, if an estimate is used such that 
Pr(a;|e) = 1, ( |SumProd| ) effectively becomes a weighted 
average or expectation over support documents. 

Later, Balog et al. 2 proposed a non-probabilistic scor- 
ing scheme by assuming uniform priors over documents and 
entities: 
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and then simply omitting the division by \Se\, effectively 
just adding up context scores, instead of averaging them. 
This retains the signal in the absolute support \Se\, also 
highlighted as vital by others [26|. (Note that \Eq\, the 
number of candidate entities for query q, can be ignored 
even in a truly probabilistic framework, as it is fixed for the 
query.) 

Macdonald and Ounis 26 provided among the first sys- 
tematic studies of a space of possible aggregation functions 
in collecting evidence from contexts. However, the paradigm 
was restricted to first computing a (fixed, not learnt) score 
for each context, lining up a number of aggregates (such as 
min, max, sum, average, etc.) and then learning a linear 
combination among these. They did not unify voting with 
feature-based proximity scoring. 

Curiously, Macdonald and Ounis 26 found that the "Ex- 
pCombMNZ" aggregate feature, defined as 



j5e| exp(score(a;, g)). 



(ExpCombMNZ) 



consistently performed best. Here score is any function 
used for calculating match for document x w.r.t query q. 
Standard examples of such functions include BM25 [l9] and 
TFIDF cosine. This is much more extreme than Balog's 
sum: large scores, exponentiated, will overwhelm smaller 
scores, and, instead of dividing by \Se\, we multiply. This 
effect can also be achieved by a rank cutoff [13| |^ or a 
soft-OR aggregation j22 . 

2.3 L2R based on entity relevance 

Fang et al. [l5] propose a noteworthy exception to the 
above paradigm: write 



Pr(e|g) ^ Pr(a 

xeSc 



llg,x)Pr(i?,,e = l\x,e) 



(1) 

l.q,x, -Ltx,e- Now 

model each component Pr{Rq^x ~ and PT{Rx,e ~ 

l\x, e) as a logistic regression. The formulation is nice in that 
it permits training from labeled entities alone; no labeling 
of contexts is needed. However, this flexibility results in 
a non-convex learning problem. Also, thanks to the Pr(a::) 
term, the signal in |Se| is still lost. Their loss function is 
itemwise, not pairwise or listwise [24| . (In contrast, ours 
is pairwise like RankSVM [18].) Furthermore, Fang et al. 
have no mechanism to capture proximity through features. 

2.4 Some other related systems 

Some systems for large-scale entity search [iT 22 10 have 
been reported in the database community. Reminiscent of 



Macdonald, Cummins and coauthors, EntityRank [TT] as- 
sumes additive aggregation of the form p{x)score(e, x, q) 
where a; is a page and p{x) its PageRank. Proximity scor- 
ing was hardwired. No learning was involved. EntityEngine 



22 is the only system to have use a soft-or aggregation, but 
no feature-based learning was involved. None of these sys- 
tems supported open-domain entities; the largest number of 
broad entity types supported was 21 [|10 . 

2.5 Overview of our unified framework 

The above picture is somewhat diverse and chaotic, and 
our main goal is to unify all the above efforts in a uniform, 
trainable feature-based discriminative ranking framework. 

A context x £ Se has an associated (query dependent) 
feature vector fq{x,e) € . q is dropped if clear from 
context. Note that, in general, e is also an input to fq. 
E.g., we may find that features for people should be different 
from features for places. Or we may use various collective 
statistics from Se inside fq. To keep learning simple, we will 
assume the raw score of a context is w ■ fq{x, e) where w G 
R^^ is the context-level (proximity-cognizant) scoring model 
to be trained. Next we must aggregate the raw context 
scores into a score for e: 



Vie) = ^{T{w ■ fqix,e)) : X e S,}, 



(Aggr) 



where ^ is a suitable score aggregation operator. Entities 
will be sorted by decreasing V{e) and presented to the user. 
( Aggr I is shown pictorially in Figure [l] Here T £ 



is a (usually monotone) transformation such as T{a) = a, 
T{a) = log(l -I- a), or T{a) = e". If T is convex and 
fast-growing, we get a soft-max effect, whereas if it is con- 
cave (diminishing returns) we get a soft-count effect. For 
some scoring schemes, \Se\ may be recovered simply by us- 
ing T(a) = la > 01. 
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Figure 1: Feature and aggregation formula. 

Most existing systems can be expressed within the above 
paradigm (perhaps with the training part replaced by hand 
tuning). E.g., in case of some language models, w ■ fq{x,e) 
can be interpreted as We ■ fq(x) where We encodes a language 
model for e and fq{x) selects count and position of query 
words in context x. 



3. PROXIMITY FEATURES 

In this section we design fq{x, e) to reward proximity in a 
trainable manner. We will assume the raw score of a context 
is w ■ fq{x, e) where w is the context-level scoring model to 
be trained. Usually, fq{x,e) > 0. To the objective func- 
tions that we seek to minimize, we will also add a standard 
regularization js] term of the form , where A is a hyper- 
parameter that we fit via cross validation. 

In what follows, the inverse document frequency or IDF(f) 
of a term t is defined as the inverse of the fraction of docu- 
ments where the term occurs. (Instead of the inverse we 



3 



also tried the negative log [33] but results were not dis- 
tinguishable.) This can also be interpreted as the "sur- 
prise value" or perplexity of finding t in a context. Let 
IDF(g) = Ei6,IDF(t). 

3.1 Document vs. snippet 

Evidence from different mentions of an entity in a doc- 
ument are scarcely independent, so it is common [25] to 
choose one mention from each document that is most fa- 
vorable to the entity, i.e., with the largest score. Multiple 
occurrences of query words in the context offer a similar is- 
sue. When the context is a short snippet, ignoring all but 
the match closest to the entity mention is reported to work 
well [8]. 

3.2 Baselines 

The NoProx baseline scores the entire document wrt the 
query without regard to the position/s of entity mention/s. 
TFIDF cosine, BM25, TFIDF-weighted Jaccard similarity, 
or probabilistic language models may be used. Each such 
score can be one feature, and w can combine them suitably. 
It is also common to add a constant feature (value 1, say) 
which allows w to effectively count the number of support 
contexts in 5*6. 

The second baseline, which we expect to be better than 
the first, is to use a proximity kernel [28] with a tuned 
width together with a probabilistic language model. This 
should also approximate well other similar hardwired prox- 
imity scoring schemes 11 22 10 . (Note that Lv and Zhai 
|25i 



while using a positional language model, were ranking 
documents, not entities, so they do not specify any aggrega- 
tion logic.) 

3.3 Perplexity-proximity features 

Consider one mention of a candidate entity in a document, 
together with just the closest occurrences of each query word 
that matches in the document. Each matched word t is char- 
acterized by two quantities: the perplexity IDF(t), and the 
distanc^^ (number of tokens) between the entity mention 
and t. We now describe three natural ways to represent the 
event of t occurring at distance I from the entity mention. 

3.3.1 Cumulative perplexity up to distance I 

Suppose there are three query terms q — {ti,t2,t3} , only 
t2,t3 appear in the context, at distances £2,^3 from the 
mention. Imagine we are plotting a graph: the x-axis is 
distance £, and the y-axis is the sum of IDF{t) / IDF{q) 
for all t matched within distance £. The plot starts at 
(0, 0), and jumps up at any distance where there is a match, 
in this example, from to lDF{t2) / lDF{q) at £2, then to 
(IDF(t2) + IDF(t3))/IDF((7) at £3. The final value is the 
fraction of query IDF that is matched in the context (1, if 
all query terms were found). This forms a normalized fea- 
ture space for learning w. We call these IdfUpto features. 

3.3.2 Grid features 

Now consider a query term t that matches at distance 
£ from the mention of candidate e. Then IDF(t)/ IDF(g) £ 
[0, 1] decides the "perplexity coordinate" of the match, whereas 
£ decides the "proximity coordinate" of the match. In Fig- 
ure [2] there are two query terms that match. Capital has 

^For simplicity we use absolute distance, but left/right can 
also be encoded naturally using signed distance. 



lower IDF, but is closer to the candidate, Abuja compared 
to Nigeria, with higher IDF but farther away. 

...Abuja officially gained its status as the capital of Nigeria. 

< ► 
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Perplexity (IDF) 
Figure 2: Perplexity-proximity grid features. 

Each axis is suitably bucketed to fire one feature (i, j) in a 
grid of features (which is later fiattened to a single index in 
a 1-dimensional vector fq(x,e)). Every query word match 
results in firing one cell in the feature grid (shown as the 
circles). This is a "universal" encoding, without any com- 
mitment on how perplexity and proximity should be com- 
bined; the combination is decided by learning w. We call 
these grid features. 

Note that w has an element Wij > corresponding to 
each grid cell. Because our discretization is arbitrary, we 
do not expect Wi^j to differ much from Wi±ij±i. Therefore, 
this part of w should not be regularized (only) as wf^j / (2}?), 
but as 
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Assuming row i-\-l means "more IDF" than row i and column 
j + 1 means "more proximity" than column j, we may also 
want to enforce monotonicity constraints of the form 

> Wi^j and Wi,j+\ > Wij. 

3.3.3 Rectangle features 

The above forms of constraints over the perplexity- proximity 
grid complicate model training, and can be avoided by a 
transformation of the grid features to rectangle features. 
As Figure [2] shows, each query term matched in the snippet 
fires one corresponding grid feature {i,j) (shown by the two 
circles). In the rectangle feature encoding, we also turn on 
all cells that have lower IDF or worse proximity. This en- 
sures that if {i,j) and are close together, the features 
fired have a large overlap (double-hatched area). Also, the 
farther to the south-east corner {i,j) is, the more features 
are fired. Note that rectangle features no longer require the 
above constraints, just Wij > is enough. 

4. EVIDENCE AGGREGATION 

As described in Section |2.5[ we use a general expression 
for entity value ( |Aggr[ | that combines proximity scoring and 
evidence aggregation. The parameters inside ( Aggr I will be 
trained using entity-level relevance judgment. 

For query q let Gq, Bq be sets of good (relevant) and bad 
(irrelevant) entities. We will use g, b for good and bad en- 
tities. x+,X- will denote contexts potentially supporting 
good and bad entities. Before moving on to our suite of 
aggregation learners, we note that one may also attempt to 
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directly use L2R techniques at a context level. E.g., we can 
directly use RankSVM ^S] with hinge loss at the context 
level, to minimize wrt w the objective 



max{0, 1 + (/,(a;_,6)-/,(a;+,6))} 



E 



g,b x^£Sg 



\Gq\\Bq\ 



Note that, while IG5II-B9I is used to normalize the loss across 
queries, the loss is not scaled down by 15*91 or which 
would average out context support (see Section [4. 2[ ). 

Context-level formulations are impractical to train. In 
our data set, cases of 10^^ context pairs {x+,x^) are not 
at all rare. Even an efficient stochastic gradient or sub- 
gradient descent method has no hope of dealing with such 
scale without extreme sampling. So some form of direct 
aggregation to entity score is essential. 

Another problem (further motivating Fang et al.'s work 
[15| ) is that a context supporting a good entity is not neces- 
sarily an evidence context as judged by a human; the entity 
mention and some query terms may be juxtaposed coinci- 
dentally. On the other hand, acquiring context-level supervi- 
sion is orders of magnitude more expensive than entity-level 
relevance judgment. 

4.1 |Se| baseline 

A baseline that is known [26] to be very competitive in our 
testbed is to ignore the quality of the match between query 
and context altogether (given at least one term overlap) and 
simply use the number of supporting contexts, i.e., w = (1), 
fq{x,e) = (1), © = E, and so V{e) = \Se\. AU other 
aggregations must be compared against this trivial baseline. 

4.2 Sum and average 

If we believe x has any signal, and all entity ranking sys- 
tems believe so, we should use nontrivial fq{x,e)s. Two 
obvious aggregators that suggest themselves are: 



and 



Vie) 



E ■ /'j(^'e) (Sum) 



Here T{a) oc a. (|Sum[ ) mimics Balog's non-probabilistic 
formulation [2I . ( Avg I is our approximation to the evidence 



aggregation done by generative language models ( SumProd \ 



Generally, for the sums above to be meaningful, we want no 
cancellation of terms, so we will design ,fq(- ■ ■) > and 
constrain w >Q. (For all existing systems this is the case.) 

4.3 SoftMax 

Within the context of TREC entity search, Macdonald 
et al. [26j, Cummins et al. [l^ and others have noted that 
not all x £ Se should contribute to V{e). Some of these 
matches are high-noise and should be tuned down (over and 
above a hopefully low context score itself) or eliminated. 
They try to achieve this effect in two different ways. Cum- 
mins et al. implement a soft cutoff as a weighted sum 



V(e)= J2 D{x-S.)wfqix,e), 



(SoftCutOff) 



where D{x; Se) is a contribution weight that may depend on, 
e.g., the rank of x within Se. We present a linear program 



to learn D in Section 14.71 Macdonald et al. instead favor 
high-scoring contexts by formulating 



V{e)^ J2 T{wfq{x,e)), 



(SoftMax) 



where r(-) is a fast-growing function, such as T{a) = e". 
A few high scoring contexts will tend to dominate V{e), 
hence the name. ( ExpCombMNZ I is even more extreme. 



effectively it replaces all scores in Se by the maximum and 
adds them up. We limit ourselves to T{a) = e". 

4.4 SoftOr 

Although experience with TREC expert search is favor- 
able, it is not clear if/why soft-max is a universally supe- 
rior choice. If a few high-quality evidence contexts should 
override other supporting context scores, another natural 
aggregator readily suggests itself: the soft-or (used in Enti- 
ty Engine 22 ). The premise here is 



Pr(e is goodlSe) = 1 — ]^ (l ~ Pr(a; is evidence)) 

x€S^ 

The standard technique [Ts] to turn w ■ fq{x, e) into a prob- 
ability is to use the sigmoid function T{a) — cr{a) = 1/(1 -|- 

Pr(2; is evidence for e) — a{w ■ fq(x, e)) 
In soft-or, © is no longer E) and we get 

V{e) = 1 - n (1 - ■ hi^^ e))) . (SoftOr) 

xes^ 

4.5 SoftCount 

In both SoftMax and SoftOr, a few large context scores 
can override a number of smaller context scores. An opposite 
policy, consistent with the observation that \Se\ is a good 
scoring scheme by itself, is that all supporting contexts have 
some merit, but there is variation in their evidence quality. 
This suggests we use a concave T, with a diminishing return 
shape, instead of a convex one like exp(-). This implements a 
form of soft counting. We specifically used T{a) — log(l + a) 
but experience with other forms like T{a) — with < p < 
1 were similar. 

4.6 Training w through aggregation 

The earliest L2R formulations seek to minimize the num- 
ber of wrongly ordered entity pairs ("pair swaps") g G Gq,b G 
Bq such that V{b) > V{g). The number of pair swaps is di- 
rectly related to the area under the curve (AUC) measure in 
machine learning, and is also related to MAP by one-sided 
bounds [I8j . Minimizing pair swaps [17[ |18| is a simple and 
robust L2R approach that remains hard to beat. Rank- 
SVM [is] proposed to train w by minimizing wrt w the pair 
swap hinge loss 



E 



Go 1 1 -Bo 



max{0, 1 + V{b) - V{g)}. (HingeLoss) 

g.b 



We will avoid dual solutions and use simple gradient-descent 
optimizers by replacing the hinge loss max{0, 1 + V{b) — 
V{g)} with the continuous and differentiable soft hinge loss 

SH(1 + V{b) - V{g)), where SH(a) = log(l + e"). 
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Note that SH'(a) = — (j(a), the sigmoid function. The 
generic gradient of the above loss wrt w is 



E 



Go 1 1 -Bo 



g.b 



V{g)) 



dV{b) dV{g) 



dw 



dw 



(Gradient) 



so all that remains is to plug in V{e) and dV(e)/dw for 
all the cases discussed. When V(e) = '^^T{w ■ fq{x,e)), 
this is simpl y dV{e)/dw — '^^T'{-w ■ fq{x,e))fq{x,e). The 
0=( |SoftOr l case is not additive, but with a little work we 
can derive: 



dw ^ 1 + e-™-A(^'=) 



n 



1 

1 -I- g-vi-}\,(x,e) 



The soft hinge objective is a convex optimization only in 
the case © = X] ™d T{a) = a. However, (quasi) Newton 
optimizers like LBFGS [23] tend to behave acceptably well 
even when given non-convex problems from this domain |21|. 
Extending our approach to using listwise ranking losses [24] 
is a possible direction for future work. 

4.7 Training a soft cutoff by rank 

If we assume that w has been trained and fixed, we can set 
up a simple linear program (LP) for implementing the kind 
of soft cutoff sought by Macdonalds et al. 27 and Cummins 
et al. flsl . To make it convenient to use an LP, we will revert 



from soft hinge to iHingeLoss 
g, b, as in RankSVM 
constraint 



For good, bad entity pair 
define slack variable H{g,b) > 0, with 



Vg,6 



H{g,b) >l + Vib)^Vig) 
1 



the objective to minimize will be 



IGollBo 



The soft cutoff decay will be modeled using more variables 
_D(r) where r is a rank. Variables D axe constrained by 



Vr : 



D{r)> D{r + 1), and D{r) > 0. 



Now that w is fixed, all context scores are also fixed. Let 
the rank of context x within Se be rx- Then we have 

V{e)^ D{rx)iw-Mx,e)), 

which can be used to express H{g,b) directly in terms of 
D{-) and known constants. As in support vector machines, 
to limit the overfitting powers of D(-), we tack on to the 
objective a regularization term of the form D{0)/X where A 
is a tuned width parameter in the same sense as w ■ ■w/{2\^) 
is used in SVM regularization. Summarizing, the objective 
will be 



H>0,D>0 



DjO) 
A 



E 



E^(f'^) 



subject to the above constraints. Different entities will have 
diverse jSej. To share a decay profile D{r) across these, 
we allocated parameters in D{-) for deciles of ranks. We 
tried many other rank bucketing approaches but they did 
not affect the results significantly. 

5. EXPERIMENTS 



5.1 Data sets and statistics 

We use two data sets and tasks. The first one is the stan- 
dard TREC enterprise track expert search task used in most 
prior work on expert/entity search. The corpus has 331,000 
documents from W3C Web site. Mentions of persons (ex- 
perts) have been annotated (presumably with near-perfect 
accuracy) throughout the corpus. The total number of an- 
notations is 1.6 million. The only type of entity sought is 
a person (expert) on a given topic, so queries are just bags 
of words. On an average a query involves 680 candidate 
experts. Expert labels (relevant/irrelevant) were provided 
with the queries. We chose this reference corpus to make 
sure our implementation of reported earlier systems is faith- 
ful, with ranking accuracy scores closely matching published 
numbers. 

But our real interest is in open-domain Web-scale en- 
tity search, in which, as we shall see, competing systems 
behave rather differently compared to TREC. Our second 
testbed uses a 500 million-page Web corpus from a com- 
mercial search engine. Token spans that are likely entity 
mentions are annotated ahead of time with IDs from among 
two million entities belonging to over 200,000 types from 
YAGO ji29i. About eight billion resulting annotations were 
then indexed along with text. 

The next step was to collect queries with relevance- judged 
entities. We used 845 queries from many years of TREC and 
INEX. The queries were expressed as natural language ques- 
tions. There are two steps to answering these: identify the 
answer type from among our 200,000 types, and use (some 
of) the other query words to probe the text index. To isolate 
these two steps, and to align the task to the TREC task, we 
had five people rewrite the query into the two constituents: 
the answer type and words/phrases to be matched literally. 
Some examples follow: 

• The original query What is the name of the vaccme 
for chicken pox was labeled as seeking an entity of the 
type wordiiet_vaccine_104517535 with one or more of 
these words matched close by: drug +"chicken pox" 
+vaccine. 

• Likewise, for the original query Rotary engines were 
manufactured by which company, the type sought is 
wordnet_manuf acturer_108060446 and the keyword 
literals may be company +rotary +engine. 

The translation was done by proficient search engine users 
who use -I- and quotes properly. This is not unfair, be- 
cause the same queries and retrieval algorithms are available 
to all competing algorithms. The queries are available for 
anonymous viewing at http: / /goo.gl/T2Kkp The five vol- 
unteers also curated positive and negative entity instances 
from TREC, INEX and the Web; that data will also be made 
available in the public domain. 

On an average a query leads to evaluating 1884 candidate 
entities and 110231 contexts, for a total of 93 million con- 
texts over 845 queries. Figure [S] shows the distribution of 
the number of candidate entities per query. Figure [4] shows, 
for relevant and irrelevant entities, the number of support- 
ing contexts. Both plots show heavily skewed behavior. In 
particular, from Figure |4] we see that some entities are enor- 
mously more popular on the Web compared to others. Al- 
though some good entities have huge |Se|, we also see that 
lower down, good and bad entities are well-mixed in terms 
of |S'e|, and therefore good-bad separation during ranking 
remains a challenging problem. 
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Figure 3: Candidate entities per query. 
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Figure 4: Distribution of supporting contexts (l^ej) 
per candidate entity, good and bad. 

Figure |5] shows context score distributions within some 
sampled SeS for three good and three bad entities. All en- 
tities are fairly mixed together in the chart of context score 
vs. context rank. Therefore, as with j^el in Figure|4] entities 
are not easy to separate on the basis of score alone. Three 
major steps/levels are seen, corresponding to presence or ab- 
sence of query keywords. Within each broad level, smaller 
variations near the edges are because of diverse distances at 
which query term matches occur. 

5.2 Measurements 

Unless stated, our uniform evaluation policy was leave one 
query out cross validation. We marked each query as the 
test query, and trained our parameters on the remaining 
queries. Then we evaluated the trained parameters on the 
single test query. Finally we averaged accuracy measures 
across all test queries. This is computationally intensive, but 
exploits training data maximally and gives a more reliable 
estimate. In some cases (SoftMax and SoftOr) we reduced 
the computational cost of optimization using standard five 
fold cross validation across queries. We report entity level 
MAP, MRR, NDCG@5, NDCG@10, and pairs of good/bad 
entities that are reversed in rank. The last measure is best 
if small; others are best if large. 

5.3 Effect of proximity features 

To study the two interacting policies (features and aggre- 
gation), here we will fix the aggregation policy to our overall 
best (unweighted sum of context scores, see Section 5.4 1, and 
vary the design of features. 
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Figure 5: Distribution of context scores in a few SeS 
for three good and three bad entities. 





MAP 


MRR 


NDCG@5 


o 
^ 

@ 

O 

O 

Q 


Pairswap 


Only |Sel 


0.542-^ 


0.559-^ 


0.641-^ 


0.660-^ 


0.211^ 


NoProx 


0.559 


0.578-^ 


0.661 


0.675 


0.203 


NoProx -f 
IdfUpto 


0.560 


0.581 


0.661 


0.672 


0.221^ 


NoProx + 
rectangle 


0.563 


0.585 


0.656 


0.675 


0.202 



Figure 6: Proximity features compared (Web data). 

Figure [6] compares various proximity features for the Web 
corpus. Rectangle features lead to statistically significant 
(paired t-test at p = 0.05) improvements over not using 
proximity signals. Here, and in all tables comparing different 
settings/systems, in each column, the largest value is shown 
in boldface, and other quantities in the same column that 
are statistically significantly smaller are suffixed with a 
Proximity kernels [28] |25] perform worse than the numbers 
in Figure |6j but this is in part due to the way probabilistic 
language models are built around the kernels (also see end- 
to-end comparisons in Section [5.5[ ). 

Figure [6] shows that l^el is already a strong signal for 
Web data, which concurs with [27]. However, the proximity 
features bring out significant gains beyond |Sel. 

Figure [7| repeats the feature comparison for TREC. A 
prominent observation is that j^e] is an excellent single fea- 
ture for the Web, but not at all for TREC. Given that 
TREC-QA/INEX queries involve entities well-known to the 
Web, the "embarrassment of riches" represented in 500 mil- 
lion documents ensures that retrieved contexts are of gener- 
ally high quality, so just counting them up is not too bad. In 
contrast, in the much smaller TREC corpus, accidental sim- 
ilarities between the query and the whole document bring in 
a large fraction of poor quality contexts. This is also con- 
firmed by a later experiment: while the TREC task benefits 
from rank-based cutoffs, the Web task does not. 

Figure [8] shows the contributions made to a context score 
by features firing in each cell shown earlier in Figure [2] If 
proximity or IDF had no signal, the result would be a fiat- 
valued weight grid. Instead we see visible increase in score 
contributions as we go from the (low-IDF, large-distance) 
corner toward the (high-IDF, small-distance) corner of the 
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Figure 7: Proximity features compared (TREC). 




Figure 8: Sample rectangle model weights over the 
feature grid. 

grid. 

5.4 Effect of aggregation policies 

In this subsection we fix the feature representation to the 
best reported in the previous subsection, and explore aggre- 
gation schemes. The research questions are: 



• Prior work 13 27 suggest that contexts in Se should 
not contribute symmetrically to the score of e. Does 
SoftMax or SoftOr perform better than a simple (lin- 
ear) sum of context scores? 

• Can we get additional mileage beyond linear sum by 
making T{-) sublinear, i.e., using a SoftCount? 

• Can we get improvements by using ranks within to 
implement a soft cutoff (see subsection |4.7[ )? 

Figure [9] shows (for Web data) the effect of choosing score 
transformer T and aggregator ^ in various ways for Web 
data. Note that w is trained through this choice of T, ^ 
as explained in section [4^ and illustrated in Figure [l] The 
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Figure 9: Effect of T, and soft cutoffs (Web). 

top three rows show the discriminative aggregation schemes 
where high-scoring contexts in Se get additional preference. 
SoftMax uses exp(tn-/g(a, e)), and SoftOr is as described 
in subsection 14.41 SoftCutoff follows subsection 14.71 The 
fourth row shows simple sum "^^w ■ fq{x,e) with all con- 
texts treated symmetrically. The fifth uses ( |Avg[ ) instead of 
sum, and the last two rows show sublinear aggregation (sub 
section 4.51 and plain l^el as a trivial baseline. Figure 10 
shows, for TREC, the counterpart of Figure [9] 

Linear sum is the clear winner. It is curious that nei- 
ther superlinear nor sublinear aggregation beats linear sum. 
This could be because linear sum gives a convex optimization 
while SoftMax, SoftOr and SoftCount get trapped in local 
optima, or because there is something fundamental about 
linear sum; this is worthwhile researching further. Also note 
that averaging, as against summing, performs poorly, and 
|Se| by itself is not as good as linear sum. Even when scor- 
ing was done using linear sum and the SoftCutoff linear pro- 
gram was used to remove low-scoring contexts' contributions 
to entity scores, accuracy dropped. This lends additional ev- 
idence that symmetric context contribution to entity score 
is the best policy. 

5.5 End-to-end comparisons 

Finally, we compare our system's end-to-end accuracy against 
other systems, for both data sets. We compare with these 
prior systems: 

• Balog2 [2], without any proximity signal. 

• Macdonald et al.'s formulation [27| which uses vari- 
ous combinations of document scoring models, voting 
techniques and ranking cutoffs. 

• Petkova et al.'s formulation using proximity kernel and 
generative language model [28] . 

Although Lv and Zhai [25] used positional language mod- 
els, they did so for document, not entity ranking. There- 
fore they did not specify the all-important aggregation logic 
needed to turn their system into an entity search system, 
and so we cannot directly compare with them. 

Fang et al.'s formulation |15j makes (probabilistic) anno- 
tation a query-time activity along with score aggregation. 
While novel, this approach is not practical at Web scale, 
where entities may need to be annotated in millions of snip- 
pets at query time. In both our data sets, annotation is con- 
ducted offline, which effectively turns Fang et al.'s system 
into a single logistic regression for context scoring, followed 
by an expectation over contexts, which we already know as 
surpassed by = J^. 

Petkova et al. 28 not only suffer from the same weighted 
average limitation, but is also impractical to implement on a 
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Figure 10: Aggregation choices for TREC. 

Web-scale distributed index. Instead of (jSumProdl, Petkova 
evaluates the kernel over all documents for each entity men- 
tion, then combines them. Therefore -we can present num- 
bers for TREC alone. 
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Figure 11: End to end comparisons (TREC). 

Figure [Tl] sho-ws TREC results. For TREC 2005, Mac- 
donald is better than Balog2. For TREC 2006, Balog2 is 
better than Macdonald. Our system scores slightly less, but 
is occasionally the best (NDCG@5 for TREC 05 and MRR 
for TREC 06). Petkova implements the "Balogl" model ^ 
equation (3)], kno-wn for higher recall and lo-wer precision, 
and falls behind the others. 

Figure [12] is for the Web data set. Here our system per- 
forms consistently better than all previous approaches tested, 
and all differences are significant. Apart from our param- 
eter learning, Balog2 does not exploit proximity. Also, as 
Figure |9] (SoftCutoff ro-w) shows, rank-based cutoffs work 
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Figure 12: End to end comparisons (Web). 

worse than symmetric aggregation for the Web, which may 
explain why Sum beats Macdonald. 

6. CONCLUSION 

We presented a system that unifies diverse, unconnected 
approaches for context scoring and entity-level score aggre- 
gation into a simple, feature-based, trainable, discriminative 
and robust framework for entity ranking. We evaluated our 
system using two data sets. On the TREC data set, we are 
best or close on most evaluation criteria. On the Web data 
set, we are considerably ahead of the competition on all cri- 
teria. The main lessons, some confirming earlier wisdom, 
were: 

• Simple rectangle features, that capture query match 
perplexity and lexical proximity, work better than prox- 
imity kernels in conjunction with probabilistic language 
models. 

• In case of TREC, \Se\ is a valuable signal; for the Web, 
it is not. In all cases, adding more features helped. 

• How we aggregate makes or breaks algorithms. In gen- 
eral, we should sum, not average evidence. This has 
serious implications for probabilistic entity scores that 
look like X:^(---)Pr(a:|e). 

• Sublinear (SoftCount) or superlinear (SoftMax) con- 
text score combinations did not yield better ranking 
than a simple linear combination; neither did SoftOr. 
Rank-based asymmetry in score aggregation did not 
help for Web data. 

Part of our contribution is a fully implemented search system 
that answers in a few seconds open-domain entity queries 
using two million entities and 200,000 types executed over 
500 million Web pages, soon to be upgraded to two billion 
pages. Our code, a demo, and a search API (as a Web 
service) will be placed in the public domain. 
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